Spark Transformation Actions
We can’t Change them so how we are going to process them
but we can give instruction to driver what to do like 😀
filter(age<40)groupBy(county.count())Now driver will decide how to achieve it by instruction.
This instruction are called Transformer
Transformer -- same sa SQL like operation
import sys import os from pyspark import SparkConf from pyspark.sql import SparkSession from lib.logger import Log4J from lib.utils import get_spark_app_config if __name__ == "__main__": conf = get_spark_app_config() spark = SparkSession.builder \ .config(conf=conf) \ .getOrCreate() logger = Log4J(spark) if len(sys.argy) != 2: logger.error("Usage: HelloSpark <filename>") sys.exit(-1) logger.info("Starting Spark Session") conf_out = spark.SparkContext.getConf() logger.info("Finished HelloSpark") # ---- Read The Dataframe ---- spark_df = spark.read \ .option('header','true') \ .csv(sys.argv[1]) spark.stop()
We need to modify transformation by using spark_df
spark_df.where("Age <40") \ .select("Age", "Gender", "Country","state") \ .groupBy("Country")
We can do like this or break it to variable also
Here we are creating a graph of operation.
This all are going to be create as of activities
DAG
These are 2 type of operation :
- Transformation
- Narrow Dependency Transformation
- Wide Dependency Transformation
- Action
How to Fix this Group by as result isn't perfect ?
we can do this by Combine & Repartition
Combine all partition
*
All below are is Action 😀
- Read
- Wrote
- Collect
- Show
Spark Action will terminate the transformation dag and trigger the execution.